What is Spark?¶

In memory distributed data processing framework/engine.

Spark Features:¶

  • In memory computation
  • Open source
  • Cost effective
  • Fault-tolerance
  • Supports multiple languages (Python, Scala, Java, and R)
  • Lazy Evaluation
  • Batch/Near real-time data streaming
  • Very rich built-in library support
  • As compared to Java (20–25 lines of code), we can replace with single line code in Python/Scala
  • 100 times faster in memory than Hadoop MR or other traditional systems
  • 10 times faster on disk
  • Spark can integrate with Hadoop ecosystem, ETL tools like Talend, Informatica, etc., and cloud platforms (AWS, Azure, GCP)

image.png

image.png

image.png

On-Heap vs Off-Heap Memory¶

Aspect On-Heap Memory Off-Heap Memory
Location Inside JVM Heap Outside JVM Heap (native memory)
Management JVM Garbage Collector Spark (Tungsten engine)
Default Yes No (needs config)
Performance Slower for large data (GC overhead) Faster for large data (no GC)
Storage Format Java objects Serialized binary
Use Cases Small to medium datasets, default Spark jobs Large datasets, Spark SQL/DataFrame, caching, joins, shuffles
Risk GC pauses, OutOfMemoryError Native memory leaks if misused
Configuration No extra setup spark.memory.offHeap.enabled=true and spark.memory.offHeap.size
Example Default caching .cache() Optimized Tungsten memory, serialized storage

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

image.png

Note in above code, when we are using reduceByKey it involves shuffling, so it creates a new stage(step).

image.png

image.png

image.png

image.png¶

image.png

In [ ]: